Wine Quality Prediction

WINE!!!

Through this project we embark on an exhilarating journey through the world of wine, where we explore the delicate art of predicting wine quality. With each sip and swirl, we delve into the enchanting realm of data analysis and visualization, uncovering the hidden gems within the chemical attributes that define wine excellence.


LOADING THE DATASETS

Head
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6
white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6
white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
Tail
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
6493 red 6.2 0.600 0.08 2.0 0.090 32 44 0.995 3.45 0.58 10.5 5
6494 red 5.9 0.550 0.10 2.2 0.062 39 51 0.995 3.52 NA 11.2 6
6495 red 6.3 0.510 0.13 2.3 0.076 29 40 0.996 3.42 0.75 11.0 6
6496 red 5.9 0.645 0.12 2.0 0.075 32 44 0.996 3.57 0.71 10.2 5
6497 red 6.0 0.310 0.47 3.6 0.067 18 42 0.996 3.39 0.66 11.0 6

We have succesfully loaded the dataset. We can see that there are 6497 observations and 13 variables.

Next lets look at the summary statistics.


SUMMARY STATISTICS

Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:6497 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.00
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.250 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 17.0 1st Qu.: 77 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.00
Median Mode :character Median : 7.00 Median :0.29 Median :0.310 Median : 3.0 Median :0.047 Median : 29.0 Median :118 Median :0.995 Median :3.21 Median :0.51 Median :10.3 Median :6.00
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.4 Mean :0.056 Mean : 30.5 Mean :116 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.5 Mean :5.82
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.40 3rd Qu.:0.390 3rd Qu.: 8.1 3rd Qu.:0.065 3rd Qu.: 41.0 3rd Qu.:156 3rd Qu.:0.997 3rd Qu.:3.32 3rd Qu.:0.60 3rd Qu.:11.3 3rd Qu.:6.00
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.00
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

A quick look at the summary tells us the inter-quartile ranges and maximum and minimum values for each variable.

Observations:

  1. There are also some NA values which we need to remove.

  2. There might be duplicates in the dataset.


CLEANING THE DATA

Removing Duplicates
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5329 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16.0 1st Qu.: 75 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.30 Median :0.310 Median : 2.7 Median :0.047 Median : 28.0 Median :116 Median :0.995 Median :3.21 Median :0.51 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30.1 Mean :114 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.41 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41.0 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.60 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.0
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

Duplicates can cause issues later with biases. Thus, we have removed the duplicate values.

Now, the dataset has 5329 observations.

Removing NA Values
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5295 Min. : 3.80 Min. :0.080 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.220 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.230 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16 1st Qu.: 74 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.300 Median :0.310 Median : 2.7 Median :0.047 Median : 28 Median :116 Median :0.995 Median :3.21 Median :0.510 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.344 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30 Mean :114 Mean :0.995 Mean :3.22 Mean :0.533 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.410 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.600 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.580 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289 Max. :440 Max. :1.039 Max. :4.01 Max. :2.000 Max. :14.9 Max. :9.0

After, removing the NAs, we are finally left with 5295 observations. The dataset is now clean.

Now, the data is clean. We have removed the NAs, and duplicates. We can now start our exploratory data analysis.

Observations:

  1. There is a big difference between maximum value and Q3 values. This means that there are a quite a lot of outliers.

Let’s visualize this using plots.


UNIVARIATE PLOTS

To understand in depth as to which factor effects the quality the most.

Let’s look at the individual variables in the data set.

Observations:

  1. Wine quality shows a rather symmetrical distribution.

  2. Most wines have a quality score of 6.

  3. No wine achieved the highest score of 10 and the worst wines got a rating of 3.

Let’s see how the other factors are effecting!

Observations:

  1. Most distributions encountered during the exploration of the parameters looked rather usual. In general, they were positively skewed with a narrow main peak.

We will also have a look at at the box plots for these!

p1 <- ggplot(data = wine, aes(x = "", y = fixed.acidity )) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Fixed Acidity",
    y = "TaOH Concentration [g/L]"
  )

p2 <- ggplot(data = wine, aes(x = "", y = volatile.acidity)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Volatile Acidity",
    y = "AcOH Concentration [g/L]"
  )

 p3 <- ggplot(data = wine, aes(x = "", y = citric.acid)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Citric Acidity",
    y = "Concentration [g/L]"
  )

 p4 <- ggplot(data = wine, aes(x = "", y = pH))+
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "pH",
    y = "pH"
  )
grid.arrange(p1,p2, p3,p4, nrow = 1)

Observations:

  1. Residual sugar has a very long-tail distribution with many outliers. It will be interesting to see how these outliers affect the quality of wine.

  2. Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot.

  3. Most wines have less than 11% alcohol.

  4. Density has a very normal looking distribution with most of the values falling between 0.995 and 1.


SUMMARY OF UNIVARIATE PLOTS

  1. In general, the variables were positively skewed with a narrow main peak.

  2. Most wines have a pH of 3.2. Since we have chlorides,citric acid, and fixed and volatile acidity, the wines were bound to be on the acidic side.

  3. The wines have an alcohol content ranging between 8 and 15 vol%.

Now let us see that how different factors are related to the quality!


CORRELATION MATRIX

First, we will build correlation matrix to identify the variables which influence quality the most.

Observations:

Our Target Variable is Quality, so we will focus on only those parameters which influence quality.

  1. Alcohol and quality have a high positive correlation.

  2. Density and quality have a negative correlation.

  3. Low correlation between quality and chloride concentration.

  4. Wine quality also slightly negatively correlates with volatile acidity.

  5. It has a slight positive correlation with citric acid.

From the correlation matrix, let us now see how these parameters vary over different quality ratings for red and white wine separately and find out how they are different.


SUBSETTING DATA INTO RED AND WHITE WINEs

We will now subset the data into white and red wine separately to go ahead with our analysis.


BIVARIATE PLOTS

Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparitive analysis between red and white wine individually!

1. ALCOHOL VS QUALITY

The boxplot shows that wines with higher quality seem to have a higher alcohol content.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.

t_test_alcohol <- t.test(red$alcohol,white$alcohol, level=0.05)

p <- t_test_alcohol$p.value

Observations:

  1. Since the p-value of 3.606^{-6} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol concentration between the two wines.

Let us check how it varies for red and white wine.

Observations:

  1. White Wines have higher alcohol content.

  2. Alcohol has a strong positive correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in red wine Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 5 438 87.6 103 <2e-16 *** Residuals 1347 1144 0.8
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Length Class Mode
as.factor(quality) 60 -none- numeric

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in white wine

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in white wine

##                      Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(quality)    6   1467   244.5     220 <2e-16 ***
## Residuals          3935   4377     1.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = alcohol ~ as.factor(quality), data = white)
## 
## $`as.factor(quality)`
##       diff     lwr     upr p adj
## 4-3 -0.147 -0.8867  0.5934 0.997
## 5-3 -0.479 -1.1809  0.2223 0.405
## 6-3  0.304 -0.3957  1.0035 0.861
## 7-3  1.175  0.4692  1.8808 0.000
## 8-3  1.535  0.7882  2.2821 0.000
## 9-3  1.835  0.2794  3.3906 0.009
## 5-4 -0.333 -0.6009 -0.0644 0.005
## 6-4  0.451  0.1876  0.7134 0.000
## 7-4  1.322  1.0427  1.6006 0.000
## 8-4  1.682  1.3109  2.0527 0.000
## 9-4  1.982  0.5675  3.3957 0.001
## 6-5  0.783  0.6660  0.9003 0.000
## 7-5  1.654  1.5046  1.8040 0.000
## 8-5  2.014  1.7278  2.3011 0.000
## 9-5  2.314  0.9199  3.7087 0.000
## 7-6  0.871  0.7313  1.0111 0.000
## 8-6  1.231  0.9496  1.5130 0.000
## 9-6  1.531  0.1378  2.9245 0.020
## 8-7  0.360  0.0634  0.6568 0.006
## 9-7  0.660 -0.7365  2.0564 0.805
## 9-8  0.300 -1.1179  1.7176 0.996

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.


2. DENSITY VS QUALITY

The boxplot shows that wines with higher quality seem to have a less denser.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in density level.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in density level. [1] 6.92e-322

Observations:

  1. Since the p-value of 7^{-322} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean densitylevel between the two wines.

Let us check how it varies for red and white wine

Observations:

  1. Red Wines are more dense than white wines.

  2. Density has a negative correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in red wine.

##                      Df  Sum Sq  Mean Sq F value Pr(>F)    
## as.factor(quality)    5 0.00021 4.27e-05    12.8  4e-12 ***
## Residuals          1347 0.00451 3.40e-06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = density ~ as.factor(quality), data = red, conf.level = 0.95)
## 
## $`as.factor(quality)`
##          diff       lwr       upr p adj
## 4-3 -0.000926 -0.002730  8.77e-04 0.686
## 5-3 -0.000380 -0.002047  1.29e-03 0.987
## 6-3 -0.000886 -0.002553  7.82e-04 0.654
## 7-3 -0.001413 -0.003114  2.88e-04 0.167
## 8-3 -0.002369 -0.004451 -2.87e-04 0.015
## 5-4  0.000546 -0.000210  1.30e-03 0.309
## 6-4  0.000041 -0.000718  8.00e-04 1.000
## 7-4 -0.000486 -0.001316  3.43e-04 0.550
## 8-4 -0.001442 -0.002902  1.73e-05 0.055
## 6-5 -0.000505 -0.000819 -1.91e-04 0.000
## 7-5 -0.001033 -0.001492 -5.73e-04 0.000
## 8-5 -0.001988 -0.003274 -7.03e-04 0.000
## 7-6 -0.000527 -0.000991 -6.39e-05 0.015
## 8-6 -0.001483 -0.002770 -1.96e-04 0.013
## 8-7 -0.000956 -0.002286  3.74e-04 0.314

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in white wine.

                 Df  Sum Sq  Mean Sq F value Pr(>F)    

as.factor(quality) 6 0.00458 0.000764 105 <2e-16 *** Residuals 3935 0.02872 0.000007
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = density ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.000695 -0.00259 1.20e-03 0.934 5-3 0.000182 -0.00161 1.98e-03 1.000 6-3 -0.001160 -0.00295 6.33e-04 0.475 7-3 -0.002824 -0.00463 -1.02e-03 0.000 8-3 -0.003139 -0.00505 -1.23e-03 0.000 9-3 -0.003424 -0.00741 5.61e-04 0.147 5-4 0.000877 0.00019 1.56e-03 0.003 6-4 -0.000465 -0.00114 2.09e-04 0.392 7-4 -0.002129 -0.00284 -1.41e-03 0.000 8-4 -0.002444 -0.00339 -1.49e-03 0.000 9-4 -0.002729 -0.00635 8.93e-04 0.283 6-5 -0.001342 -0.00164 -1.04e-03 0.000 7-5 -0.003006 -0.00339 -2.62e-03 0.000 8-5 -0.003321 -0.00406 -2.59e-03 0.000 9-5 -0.003606 -0.00718 -3.44e-05 0.046 7-6 -0.001664 -0.00202 -1.31e-03 0.000 8-6 -0.001979 -0.00270 -1.26e-03 0.000 9-6 -0.002264 -0.00583 1.30e-03 0.500 8-7 -0.000315 -0.00108 4.45e-04 0.885 9-7 -0.000600 -0.00418 2.98e-03 0.999 9-8 -0.000285 -0.00392 3.35e-03 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density level across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.


3. CHLORIDES VS QUALITY

The boxplot shows that wines with higher quality seem to have a less chlorides.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in chloride concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in chloride concentration. [1] 3.71e-159

Observations:s

  1. Since the p-value of 3.706^{-159} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration between the two wines.

Let us check how it varies for red and white wine

Observations:

  1. Red Wines have more chloride concentration than white wines.

  2. Chloride Concentration has a slight negative correlation with quality

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in red wine.

                 Df Sum Sq Mean Sq F value  Pr(>F)    

as.factor(quality) 5 0.06 0.01286 5.34 7.2e-05 *** Residuals 1347 3.24 0.00241
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = chlorides ~ as.factor(quality), data = red, conf.level = 0.95)

$as.factor(quality) diff lwr upr p adj 4-3 -0.03244 -0.0808 1.59e-02 0.393 5-3 -0.02851 -0.0732 1.62e-02 0.452 6-3 -0.03730 -0.0820 7.40e-03 0.164 7-3 -0.04567 -0.0913 -7.67e-05 0.049 8-3 -0.05415 -0.1100 1.66e-03 0.063 5-4 0.00394 -0.0163 2.42e-02 0.994 6-4 -0.00485 -0.0252 1.55e-02 0.984 7-4 -0.01323 -0.0355 9.01e-03 0.534 8-4 -0.02170 -0.0608 1.74e-02 0.610 6-5 -0.00879 -0.0172 -3.65e-04 0.035 7-5 -0.01716 -0.0295 -4.85e-03 0.001 8-5 -0.02564 -0.0601 8.82e-03 0.276 7-6 -0.00837 -0.0208 4.05e-03 0.389 8-6 -0.01685 -0.0514 1.77e-02 0.731 8-7 -0.00848 -0.0441 2.72e-02 0.984

Observations

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant differnece in chloride concentration.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in white wine.

                 Df Sum Sq Mean Sq F value Pr(>F)    

as.factor(quality) 6 0.114 0.0190 37.7 <2e-16 *** Residuals 3935 1.987 0.0005
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = chlorides ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.004162 -0.01993 0.011603 0.987 5-3 -0.001961 -0.01691 0.012985 1.000 6-3 -0.009171 -0.02407 0.005733 0.538 7-3 -0.016760 -0.03180 -0.001725 0.018 8-3 -0.017544 -0.03346 -0.001633 0.020 9-3 -0.026900 -0.06004 0.006239 0.201 5-4 0.002201 -0.00351 0.007915 0.917 6-4 -0.005009 -0.01061 0.000592 0.115 7-4 -0.012598 -0.01854 -0.006655 0.000 8-4 -0.013382 -0.02128 -0.005481 0.000 9-4 -0.022738 -0.05286 0.007386 0.281 6-5 -0.007210 -0.00970 -0.004714 0.000 7-5 -0.014799 -0.01799 -0.011609 0.000 8-5 -0.015583 -0.02169 -0.009476 0.000 9-5 -0.024939 -0.05464 0.004765 0.168 7-6 -0.007589 -0.01057 -0.004609 0.000 8-6 -0.008373 -0.01437 -0.002373 0.001 9-6 -0.017729 -0.04741 0.011953 0.574 8-7 -0.000784 -0.00710 0.005536 1.000 9-7 -0.010140 -0.03989 0.019609 0.953 9-8 -0.009356 -0.03956 0.020846 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in chloride concentration.


4. CITRIC ACID VS QUALITY

The boxplot shows that wines with higher quality seem to have a high citric acid.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in citric acid concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in citric acid concentration. [1] 1.33e-26

Observations:

  1. Since the p-value of 1.335^{-26} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid concentration between the two wines.

Let us check how it varies for red and white wine

Observations:

  1. White Wines have more citric acid concentration than red wines.

  2. Citric ACid Concentration has a slight positive correlation with quality.

  3. There isn’t much difference in citric acid concentration in white wines across the quality ratings.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in red wine. Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 5 2.9 0.584 16.1 1.8e-15 *** Residuals 1347 48.8 0.036
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = citric.acid ~ as.factor(quality), data = red, conf.level = 0.95)

$as.factor(quality) diff lwr upr p adj 4-3 0.00477 -0.18279 0.1923 1.000 5-3 0.07377 -0.09948 0.2470 0.830 6-3 0.10949 -0.06389 0.2829 0.465 7-3 0.20086 0.02402 0.3777 0.015 8-3 0.21194 -0.00453 0.4284 0.059 5-4 0.06901 -0.00965 0.1477 0.124 6-4 0.10472 0.02579 0.1836 0.002 7-4 0.19609 0.10983 0.2823 0.000 8-4 0.20717 0.05542 0.3589 0.001 6-5 0.03572 0.00304 0.0684 0.023 7-5 0.12708 0.07934 0.1748 0.000 8-5 0.13817 0.00450 0.2718 0.038 7-6 0.09137 0.04318 0.1396 0.000 8-6 0.10245 -0.03138 0.2363 0.246 8-7 0.01108 -0.12720 0.1494 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in citric acid level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in white wine.

                 Df Sum Sq Mean Sq F value Pr(>F)  

as.factor(quality) 6 0.2 0.0363 2.43 0.024 * Residuals 3935 58.8 0.0150
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = citric.acid ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.031789 -0.117583 0.05400 0.930 5-3 0.000279 -0.081059 0.08162 1.000 6-3 0.002416 -0.078687 0.08352 1.000 7-3 -0.008569 -0.090389 0.07325 1.000 8-3 -0.000962 -0.087551 0.08563 1.000 9-3 0.050000 -0.130341 0.23034 0.983 5-4 0.032068 0.000969 0.06317 0.038 6-4 0.034205 0.003727 0.06468 0.016 7-4 0.023220 -0.009118 0.05556 0.342 8-4 0.030828 -0.012172 0.07383 0.344 9-4 0.081789 -0.082144 0.24572 0.762 6-5 0.002137 -0.011441 0.01572 0.999 7-5 -0.008848 -0.026203 0.00851 0.743 8-5 -0.001241 -0.034472 0.03199 1.000 9-5 0.049721 -0.111925 0.21137 0.972 7-6 -0.010985 -0.027202 0.00523 0.416 8-6 -0.003378 -0.036030 0.02927 1.000 9-6 0.047584 -0.113944 0.20911 0.977 8-7 0.007608 -0.026787 0.04200 0.995 9-7 0.058569 -0.103320 0.22046 0.938 9-8 0.050962 -0.113390 0.21531 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid level across all categories of red wine.

CORRELATION FOR THE BI-VARIATE PLOTS.

Observations:

Data cleaning is performed on both white and red in the above code, deleting rows with missing values in certain columns. Correlation matrices are subsequently computed for the cleaned datasets. The correlation matrices from the white and red wine datasets are averaged in the combined correlation matrix. A correlation plot with a customized appearance is built.


SUMMARY OF BIVARIATE PLOTS

From the bivariate plots, we have concluded that :

  1. For good quality wines, the alcohol content is more.

  2. As the density decreases, the quality gets better.

  3. As the chloride concentration decreases, quality gets better.

  4. For better wines, the citric acid concentration is more.

  5. Whites wines generally have less density and have more alcohol conent.


MULTIVARIATE PLOTS

For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important featues in wine are distributed in red and white wine.

Observations:

  1. White Wines have more sugar concentration than red wines. This might explain why white wines are usually more sweet.

  2. Red Wines have more sulphate concentration.


LIMITATIONS

Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.

  1. The number of data that we have on white wine is comparatively more than that of red wine.

  2. We have also observed that most of the data present are in the average quality range i.e, from 4-8.

For a better analysis, we need a more balanced data.


CONCLUSIONS FROM EDA

In our preliminary analysis, we have uncovered some key insights:

  1. Wines with elevated alcohol content, increased citric acid levels, lower density, and reduced chlorides tend to exhibit higher quality.

  2. It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.

With this new found knowledge, you’re better equipped to make informed choices when selecting wines!!


MODELLING THE DATA

From our EDA, we have identified some key features which influence the target variable- Quality.

As part of our next analysis, we will now perform some regression models to see which model predicts the quality of wine with most accuracy.


wine2<-wine

1. LINEAR REGRESSION

loadPkg("leaps")
reg.best2 <- regsubsets(quality~., data = numeric_data, nbest = 2, method = "exhaustive") 
plot(reg.best2, scale = "adjr2", main = "Adjusted R^2")

plot(reg.best2, scale = "r2", main = "R^2")

plot(reg.best2, scale = "bic", main = "BIC")

plot(reg.best2, scale = "Cp", main = "Cp")

model1 = lm(quality ~ volatile.acidity+residual.sugar+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol, data=numeric_data)
sum_md1 = summary(model1) # also for later use on inline codes
sum_md1
## 
## Call:
## lm(formula = quality ~ volatile.acidity + residual.sugar + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = numeric_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.992 -0.450 -0.024  0.462  3.144 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.391874   0.240127    5.80  7.2e-09 ***
## volatile.acidity     -1.390559   0.068586  -20.27  < 2e-16 ***
## residual.sugar        0.017460   0.002675    6.53  7.3e-11 ***
## free.sulfur.dioxide   0.006692   0.000825    8.11  6.4e-16 ***
## total.sulfur.dioxide -0.002224   0.000285   -7.81  6.7e-15 ***
## pH                    0.273930   0.066893    4.10  4.3e-05 ***
## sulphates             0.620295   0.071304    8.70  < 2e-16 ***
## alcohol               0.344362   0.009197   37.44  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.734 on 5287 degrees of freedom
## Multiple R-squared:  0.305,  Adjusted R-squared:  0.304 
## F-statistic:  331 on 7 and 5287 DF,  p-value: <2e-16
predicted_values <- predict(model1, newdata = numeric_data)

# Calculate residuals (the differences between predicted and actual values)
residuals <- predicted_values - numeric_data$quality

# Calculate Mean Squared Error (MSE)
mse <- mean(residuals^2)
print(sqrt(0.53))
## [1] 0.728
cat("MSE",mse,"\n")
## MSE 0.538

2. LOGISTIC REGRESSION

For Logistic Regression, we want the outcome variable to be categorical. Hence,w e are converting all the wines from 0 to 6 to “Bad Wines” and from 6-10 as “Good Wines”.

wine <- wine %>%
  mutate(quality_category = ifelse(quality < 6, 0, 1))
head(wine)
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality quality_category
1 white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6 1
2 white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6 1
3 white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6 1
4 white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6 1
7 white 6.2 0.32 0.16 7.0 0.045 30 136 0.995 3.18 0.47 9.6 6 1
10 white 8.1 0.22 0.43 1.5 0.044 28 129 0.994 3.22 0.45 11.0 6 1
winelogit1 <- glm(quality_category ~volatile.acidity+residual.sugar+pH+sulphates+alcohol+free.sulfur.dioxide+total.sulfur.dioxide, data = wine, binomial(link = "logit"))
summary(winelogit1)
## 
## Call:
## glm(formula = quality_category ~ volatile.acidity + residual.sugar + 
##     pH + sulphates + alcohol + free.sulfur.dioxide + total.sulfur.dioxide, 
##     family = binomial(link = "logit"), data = wine)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.17e+01   8.29e-01  -14.07  < 2e-16 ***
## volatile.acidity     -4.28e+00   2.45e-01  -17.45  < 2e-16 ***
## residual.sugar        5.22e-02   8.64e-03    6.04  1.5e-09 ***
## pH                    8.50e-01   2.24e-01    3.80  0.00015 ***
## sulphates             1.89e+00   2.48e-01    7.64  2.1e-14 ***
## alcohol               9.57e-01   3.71e-02   25.78  < 2e-16 ***
## free.sulfur.dioxide   1.84e-02   2.80e-03    6.59  4.5e-11 ***
## total.sulfur.dioxide -7.04e-03   9.27e-04   -7.59  3.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6999.2  on 5294  degrees of freedom
## Residual deviance: 5444.6  on 5287  degrees of freedom
## AIC: 5461
## 
## Number of Fisher Scoring iterations: 4
exp(coef(winelogit1))
##          (Intercept)     volatile.acidity       residual.sugar 
##             8.53e-06             1.39e-02             1.05e+00 
##                   pH            sulphates              alcohol 
##             2.34e+00             6.65e+00             2.60e+00 
##  free.sulfur.dioxide total.sulfur.dioxide 
##             1.02e+00             9.93e-01
loadPkg(regclass)
conf_matrix = confusion_matrix(winelogit1)

unloadPkg(regclass)
xkabledply(conf_matrix,"Confusion Matrix: Logit model, cutoff = 0.6")
Confusion Matrix: Logit model, cutoff = 0.6
Predicted 0 Predicted 1 Total
Actual 0 1185 794 1979
Actual 1 530 2786 3316
Total 1715 3580 5295
#install.packages("pROC")

loadPkg(pROC) # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
wine$prob=predict(winelogit1, type = c("response")) # Add new column of predicted probabilities
h <- roc(quality_category~prob, data=wine)
auc(h) # area-under-curve prefer 0.8 or higher.

Area under the curve: 0.808

plot(h)

#unloadPkg(pROC) 
#### Hosmer and Lemeshow test  
########The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit. 

loadPkg(ResourceSelection) # function hoslem.test( ) for logit model evaluation
winelogit1_Hoslem = hoslem.test(wine$quality_category, fitted(winelogit1)) # Hosmer and Lemeshow test, a chi-squared test
unloadPkg(ResourceSelection) 
winelogit1_Hoslem
Hosmer and Lemeshow goodness of fit (GOF) test

data: wine$quality_category, fitted(winelogit1) X-squared = 11, df = 8, p-value = 0.2

# We can try this just to confirm the following calculation
wine_nulllogit1 <- glm(quality_category ~ 1, data = wine, family = "binomial")
mcFadden = 1 - logLik(winelogit1)/logLik(wine_nulllogit1)
mcFadden

‘log Lik.’ 0.222 (df=8)

loadPkg(pscl) # use pR2( ) function to calculate McFadden statistics for model eval
wineLogit1pr2 = pR2(winelogit1)
## fitting null model for pseudo-r2
wineLogit1pr2
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -2722.313 -3499.575  1554.526     0.222     0.254     0.347
unloadPkg(pscl) 
true_positives <- 2786
true_negatives <- 1185
false_positives <-794
false_negatives <-530
total <- 5295

# Calculate accuracy
accuracy <- (true_positives + true_negatives) / total
cat("Accuracy of logistic regression model is", accuracy,"\n")

Accuracy of logistic regression model is 0.75

precision <- true_positives / (true_positives + false_positives)
cat(" Precision of logistic regression model is",  precision, "\n")

Precision of logistic regression model is 0.778


3. KNN

loadPkg("class")
library(class)
set.seed(123)

sample_indices <- sample(1:nrow(wine), 0.7 * nrow(wine))

train_data <- wine[sample_indices, ]
test_data <- wine[-sample_indices, ]

library(class)

# Define predictors and response variable
predictors <- names(wine)[2:13]  # Adjust column indices based on your data
response <- "quality"
train_data$quality_category <- as.factor(train_data$quality_category)


k <- 5  # You can adjust the value of k
knn_model <- knn(train = train_data[, predictors], test = test_data[, predictors], cl = train_data$quality_category, k = k)


confusion_matrix <- table(Actual = test_data$quality_category, Predicted = knn_model)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <-confusion_matrix[2,2]/sum(diag(confusion_matrix))
confusion_matrix
##       Predicted
## Actual   0   1
##      0 329 269
##      1 183 808
# Display the confusion matrix and accuracy
print(confusion_matrix)
##       Predicted
## Actual   0   1
##      0 329 269
##      1 183 808
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.716
cat("Precision :", precision, "\n")
## Precision : 0.711
str(knn_model)
##  Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 2 ...
length(knn_model)
## [1] 1589
table(knn_model)
## knn_model
##    0    1 
##  512 1077
library(caret)
set.seed(123)
cv <- trainControl(method = "cv", number = 10)
model <- train(quality_category ~ ., data = train_data, method = "knn", trControl = cv)
model
## k-Nearest Neighbors 
## 
## 3706 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3334, 3335, 3335, 3336, 3336, 3335, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy  Kappa
##   5  0.712     0.361
##   7  0.711     0.357
##   9  0.706     0.340
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

4. DECISION TREES


library(rpart)
set.seed(123) 
trainIndex <- sample(nrow(wine), 0.9 * nrow(wine))
trainData <- wine[trainIndex, ]
testData <- wine[-trainIndex, ]

tree_model <- rpart(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = trainData)

print(summary(tree_model))
## Call:
## rpart(formula = quality ~ fixed.acidity + volatile.acidity + 
##     residual.sugar + free.sulfur.dioxide + density + pH + sulphates + 
##     alcohol, data = trainData)
##   n= 4765 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.1665      0     1.000  1.000 0.0220
## 2 0.0338      1     0.834  0.844 0.0207
## 3 0.0321      2     0.800  0.811 0.0206
## 4 0.0136      3     0.768  0.789 0.0199
## 5 0.0100      4     0.754  0.783 0.0197
## 
## Variable importance
##             alcohol             density    volatile.acidity           sulphates 
##                  49                  25                  12                   5 
##       fixed.acidity      residual.sugar free.sulfur.dioxide                  pH 
##                   4                   2                   2                   1 
## 
## Node number 1: 4765 observations,    complexity param=0.166
##   mean=5.8, MSE=0.781 
##   left son=2 (2767 obs) right son=3 (1998 obs)
##   Primary splits:
##       alcohol             < 10.6  to the left,  improve=0.1660, (0 missing)
##       density             < 0.992 to the right, improve=0.1030, (0 missing)
##       volatile.acidity    < 0.458 to the right, improve=0.0519, (0 missing)
##       free.sulfur.dioxide < 18.5  to the left,  improve=0.0223, (0 missing)
##       sulphates           < 0.685 to the left,  improve=0.0132, (0 missing)
##   Surrogate splits:
##       density        < 0.993 to the right, agree=0.790, adj=0.500, (0 split)
##       sulphates      < 0.405 to the right, agree=0.607, adj=0.062, (0 split)
##       fixed.acidity  < 6.05  to the right, agree=0.603, adj=0.054, (0 split)
##       residual.sugar < 5.28  to the right, agree=0.586, adj=0.014, (0 split)
##       pH             < 3.54  to the left,  agree=0.583, adj=0.007, (0 split)
## 
## Node number 2: 2767 observations,    complexity param=0.0338
##   mean=5.49, MSE=0.553 
##   left son=4 (1498 obs) right son=5 (1269 obs)
##   Primary splits:
##       volatile.acidity    < 0.282 to the right, improve=0.08210, (0 missing)
##       alcohol             < 10.1  to the left,  improve=0.03980, (0 missing)
##       free.sulfur.dioxide < 24.5  to the left,  improve=0.02160, (0 missing)
##       density             < 0.995 to the right, improve=0.00820, (0 missing)
##       sulphates           < 0.675 to the left,  improve=0.00746, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < 29.5  to the left,  agree=0.654, adj=0.246, (0 split)
##       density             < 0.995 to the right, agree=0.634, adj=0.203, (0 split)
##       sulphates           < 0.505 to the right, agree=0.628, adj=0.190, (0 split)
##       residual.sugar      < 4.15  to the left,  agree=0.625, adj=0.182, (0 split)
##       fixed.acidity       < 6.85  to the right, agree=0.598, adj=0.124, (0 split)
## 
## Node number 3: 1998 observations,    complexity param=0.0321
##   mean=6.22, MSE=0.786 
##   left son=6 (1100 obs) right son=7 (898 obs)
##   Primary splits:
##       alcohol             < 11.7  to the left,  improve=0.0760, (0 missing)
##       free.sulfur.dioxide < 11.5  to the left,  improve=0.0442, (0 missing)
##       volatile.acidity    < 0.587 to the right, improve=0.0405, (0 missing)
##       density             < 0.991 to the right, improve=0.0311, (0 missing)
##       fixed.acidity       < 7.05  to the right, improve=0.0117, (0 missing)
##   Surrogate splits:
##       density          < 0.991 to the right, agree=0.712, adj=0.359, (0 split)
##       volatile.acidity < 0.282 to the left,  agree=0.582, adj=0.069, (0 split)
##       sulphates        < 0.365 to the right, agree=0.573, adj=0.050, (0 split)
##       fixed.acidity    < 5.95  to the right, agree=0.571, adj=0.046, (0 split)
##       residual.sugar   < 13.6  to the left,  agree=0.553, adj=0.006, (0 split)
## 
## Node number 4: 1498 observations
##   mean=5.3, MSE=0.443 
## 
## Node number 5: 1269 observations
##   mean=5.72, MSE=0.583 
## 
## Node number 6: 1100 observations,    complexity param=0.0136
##   mean=6, MSE=0.768 
##   left son=12 (174 obs) right son=13 (926 obs)
##   Primary splits:
##       volatile.acidity    < 0.455 to the right, improve=0.0600, (0 missing)
##       free.sulfur.dioxide < 11.5  to the left,  improve=0.0533, (0 missing)
##       pH                  < 3.5   to the right, improve=0.0157, (0 missing)
##       sulphates           < 0.675 to the left,  improve=0.0154, (0 missing)
##       fixed.acidity       < 7.05  to the right, improve=0.0129, (0 missing)
##   Surrogate splits:
##       pH            < 3.5   to the right, agree=0.853, adj=0.069, (0 split)
##       density       < 1     to the right, agree=0.847, adj=0.034, (0 split)
##       fixed.acidity < 14.9  to the right, agree=0.844, adj=0.011, (0 split)
## 
## Node number 7: 898 observations
##   mean=6.49, MSE=0.675 
## 
## Node number 12: 174 observations
##   mean=5.51, MSE=0.79 
## 
## Node number 13: 926 observations
##   mean=6.09, MSE=0.709 
## 
## n= 4765 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4765 3720 5.80  
##    2) alcohol< 10.6 2767 1530 5.49  
##      4) volatile.acidity>=0.282 1498  664 5.30 *
##      5) volatile.acidity< 0.282 1269  740 5.72 *
##    3) alcohol>=10.6 1998 1570 6.22  
##      6) alcohol< 11.7 1100  845 6.00  
##       12) volatile.acidity>=0.455 174  137 5.51 *
##       13) volatile.acidity< 0.455 926  657 6.09 *
##      7) alcohol>=11.7 898  606 6.49 *
library(rpart.plot)
rpart.plot(tree_model, digits = 4, extra = 1)

The root node splits the full dataset based on the alcohol content of the wines, with a split point of 10.6. Wines with alcohol < 10.6 go to the left branch and wines with alcohol >= 10.6 go to the right branch. The left branch then further splits on volatile acidity being < or >= 0.282. This results in two leaf nodes predicting quality scores of 5.30 and 5.72 for the low and high volatility acidity wines. The right branch splits again on alcohol content < or >= 11.7. The left sub-branch splits once more on volatile acidity to generate two leaf nodes predicting quality of 5.51 and 6.09. The right-most leaf node predicts a quality score of 6.49 for wines with the highest alcohol content >= 11.7. So in total it has generated 5 leaf nodes predicting quality scores ranging from 5.30 to 6.49 based on 3 split points on alcohol content and 1 split point on volatile acidity.